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Abstract 

This paper addresses the problem of learning a task 
from demonstration. We adopt the framework of in- 
verse reinforcement learning, where tasks are repre- 
sented in the form of a reward function. Our contribu- 
tion is a novel active learning algorithm that enables 
the learning agent to query the expert for more infor- 
mative demonstrations, thus leading to more sample- 
efficient learning. For this novel algorithm (General- 
ized Binary Search for Inverse Reinforcement Learn- 
ing, or GBS-IRL), we provide a theoretical bound on 
sample complexity and illustrate its applicability on 
several different tasks. To our knowledge, GBS-IRL 
is the first active IRL algorithm with provable sam- 
ple complexity bounds. We also discuss our method in 
light of other existing methods in the literature and its 
general applicability in multi-class classification prob- 
lems. Finally, motivated by recent work on learning 
from demonstration in robots, we also discuss how dif- 
ferent forms of human feedback can be integrated in a 
transparent manner in our learning framework. 



1 Introduction 

Social learning, where an agent uses information pro- 
vided by other individuals to polish or acquire anew 
skills, is likely to become one primary form of program- 



ming such complex intelligent systems (Schaal, 1999 1. 



Paralleling the social learning ability of human infants, 
an artificial system can retrieve a large amount of task 
related information by observing and/or interacting 
with other agents engaged in relevant activities. For 
example, the behavior of an expert can bias an agent's 
exploration of the environment, improve its knowledge 
of the world, or even lead it to reproduce parts of the 



observed behavior (Melo et al 2007). 



ing from demonstration. This particular form of social 
learning is commonly associated with imitation and 



emulation behaviors in nature (Lopes et al, 2009a). It 



is also possible to find numerous successful examples of 
robot systems that learn from demonstration (see the 



survey works of Argall et al 2009, Lopes et al 2010). 



In the simplest form of interaction, the demonstration 
may consist of examples of the right action to take in 
different situations. 

In our approach to learning from demonstration we 
adopt the formalism of inverse reinforcement learn- 
ing (IRL), where the task is represented as a reward 



function (Ng and Russel, 2000). From this representa- 



tion, the agent can then construct its own policy and 
solve the target task. However, and unlike many sys- 
tems that learn from demonstration, in this paper we 



propose to combine ideas from active learning (Set- 



tles, 2009) with IRL, in order to reduce the data re- 



quirements during learning. In fact, many agents able 
to learn from demonstration are designed to process 
batches of data, typically acquired before any actual 
learning takes place. Such data acquisition process fails 
to take advantage of any information the learner may 
acquire in early stages of learning to guide the acquisi- 
tion of new data. Several recent works have proposed 
that a more interactive learning may actually lead to 
improved learning performance. 



We adopt a Bayesian approach to IRL, following Ra- 



machandran and Amir (2007), and allow the learning 



In this paper we are particularly interested in learn- 



agent to actively select and query the expert for the 
desired behavior at the most informative situations. 
We contribute a theoretical analysis of our algorithm 
that provides a bound on the sample complexity of our 
learning approach and illustrate our method in several 
problems from the IRL literature. 

Finally, even if learning from demonstration is the 
main focus of our paper and an important skill for in- 
telligent agents interacting with human users, the abil- 
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ity to accommodate different forms of feedback is also 
useful. In fact, there are situations where the user may 
be unable to properly demonstrate the intended be- 
havior and, instead, prefers to describe a task in terms 
of a reward function, as is customary in reinforcement 
learning (Sutton and Barto 1998). As an example, 



suppose that the user wants the agent to learn how to 
navigate a complex maze. The user may experience 
difficulties in navigating the maze herself and may, in- 
stead, allow the agent to explore the maze and reward 
it for exiting the maze. 

Additionally, recent studies on the behavior of naive 
users when instructing agents (namely, robots) showed 
that the feedback provided by humans is often ambigu- 
ous and does not map in any obvious manner to either 



a reward function or a policy (Thomaz and Breazeal 



2008 Cakmak and Thomaz 2010). For instance, it 



was observed that human users tend to provide learn- 
ing agents with anticipatory or guidance rewards, a 
situation seldom considered in reinforcement learning 



(Thomaz and Breazeal, 2008). This study concludes 



that robust agents able to successfully learn from hu- 
man users should be flexible to accommodate different 
forms of feedback from the user. 

In order to address the issues above, we discuss how 
other forms of expert feedback (beyond policy informa- 
tion) may be integrated in a seamless manner in our 
IRL framework, so that the learner is able to recover 
efficiently the target task. In particular, we show how 
to combine both policy and reward information in our 
learning algorithm. Our approach thus provides a use- 
ful bridge between reinforcement learning (or learning 
by trial and error) and imitation learning (or learning 
from demonstration), a line of work seldom explored 



in the literature (see, however, the works of Knox and 
Stone 2010 20111 and discussion in Section 



1.1). 



The paper is organized as follows. In the remain- 
der of this section, we provide an overview of related 
work on social learning, particularly on learning from 
demonstration. We also discuss relevant research in 
IRL and active learning, and discuss our contribu- 
tions in light of existing work. Section [2] revisits core 
background concepts, introducing the notation used 
throughout the paper. Section [3] introduces our active 
IRL algorithm, GBS-IRL, and provides a theoretical 
analysis of its sample complexity. Section [4] illustrates 
the application of GBS-IRL in several problems of dif- 
ferent complexity, providing an empirical comparison 
with other methods in the literature. Finally, Section [5] 
concludes the paper, discussing directions for future re- 
search. 



1.1 Related Work 

There is extensive literature reporting research on in- 
telligent agents that learn from expert advice. Many 
examples feature robotic agents that learn simple tasks 
from different forms of human feedback. Examples in- 
clude the robot Leonardo that is able to learn new tasks 
by observing changes induced in the world (as per- 
ceived by the robot) by a human demonstrating the 
target task iBreazeal et al (2004). During learning, 



Leonardo provides additional feedback on its current 
understanding of the task that the human user can 
then use to provide additional information. We refer 



the survey works of Argall et al (2009); Lopes et al 



(2010) for a comprehensive discussion on learning from 



demonstration. 

In this paper, as already mentioned, we adopt the 
inverse reinforcement learning (IRL) formalism intro- 



duced in the seminal paper by Ng and Russel (2000). 



One appealing aspect of the IRL approach to learn- 
ing from demonstration is that the learner is not just 
"mimicking" the observed actions. Instead, the learner 
infers the purpose behind the observed behavior and 
sets such purpose as its goal. IRL also enables the 
learner to accommodate for differences between itself 



and the demonstrator (Lopes et al 2009a). 



The appealing features discussed above have led sev- 
eral researchers to address learning from demonstra- 



tion from an IRL perspective. Abbeel and Ng (2004) 



explored inverse reinforcement learning in a context of 
apprenticeship learning, where the purpose of the learn- 
ing agent is to replicate the behavior of the demonstra- 
tor, but is only able to observe a sequence of states ex- 
perienced during task execution. The IRL formalism 
allows the learner to reason about which tasks could 
lead the demonstrator to visit the observed states and 
infer how to replicate the inferred behavior. |Syed et al 
( Syed et al , 2008 Syed and Schapire 2008 ) have further 



explored this line of reasoning from a game-theoretic 
perspective, and proposed algorithms to learn from 
demonstration with provable guarantees on the per- 
formance of the learner. 

Ramachandran and Amir (2007) introduced 



Bayesian inverse reinforcement learning (BIRL), 
where the IRL problem is cast as a Bayesian inference 
problem. Given a prior distribution over possible 
target tasks, the algorithm uses the demonstration 
by the expert as evidence to compute the poste- 
rior distribution over tasks and identify the target 
task. Unfortunately, the Monte-Carlo Markov chain 
(MCMC) algorithm used to approximate the poste- 



2 



rior distribution is computationally expensive, as it 
requires extensive sampling of the space of possible 
rewards. To avoid such complexity, several posterior 
works have departed from the BIRL formulation 
and instead determine the task that maximizes the 



ior. The system also allows the human user to provide 
corrective feedback as the robot executes the learned 
taskQ The querying strategy in CBA can be classified 
both as stream-based and as mellow (see discussions 



in the survey works of Settles, 2009 Dasgupta, 2011 1 . 



likelihood of the observed demonstration (Lopes et al 



2009b Babes et al 2011 1 . 



The aforementioned maximum likelihood approaches 



of Lopes et al (2009b) and Babes et al (2011) take 



advantage of the underlying IRL problem structure 
and derive simple gradient-based algorithms to de- 
termine the maximum likelihood task representation. 
Two closely related works are the maximum entropy 



Stream-based, since the learner is presented with a 
stream of samples (in the case of CBA, samples cor- 
respond to possible situations) and only asks for the 
labels (i.e., correct actions) of those samples it feels 
uncertain about. Mellow, since it does not seek highly 
informative samples, but queries any sample that is at 
all informative. 

In the IRL literature, active learning was first ex- 



approach oflZiebart et all {|2008h and the gradient IRL plored in a preliminary version of this paper (Lopes 



approach oflNeu and Szepesvaril (120071). While the for- |et al||2009b| ). In this early version, the learner actively 



mer selects the task representation that maximizes the 
likelihood of the observed expert behavior, under the 
maximum entropy distribution, the latter explores a 
gradient-based approach to IRL, but the where the task 
representation is selected so as to induce a behavior as 
similar as possible to the expert behavior. 



Finally, Ross and Bagnell (2010) propose a learning 



algorithm that reduces imitation learning to a classifi- 
cation problem. The classifier prescribes the best ac- 
tion to take in each possible situation that the learner 
can encounter, and is successively improved by enrich- 
ing the data-set used to train the classifier. 

All above works are designed to learn from whatever 
data is available to them at learning time, data that 
is typically acquired before any actual learning takes 
place. Such data acquisition process fails to take ad- 
vantage of the information that the learner acquires 
in early stages of learning to guide the acquisition of 
new, more informative data. Active learning aims to 
reduce the data requirements of learning algorithms by 
actively selecting potentially informative samples, in 
contrast with random sampling from a predefined dis- 



tribution (Settles 2009). In the case of learning from 



demonstration, active learning can be used to reduce 
the number of situations that the expert/human user 
is required to demonstrate. Instead, the learner should 
proactively ask the expert to demonstrate the desired 
behavior at the most informative situations. 

Confidence-based autonomy (CBA), proposed by 
Chernova and Veloso (2009), also enables a robot to 



learn a task from a human user by building a mapping 
between situations that the robot has encountered and 
the adequate actions. This work already incorporates 
a mechanism that enables the learner to ask the ex- 
pert for the right action when it encounters a situation 
in which it is less confident about the correct behav- 



queries the expert for the correct action in those states 
where it is most uncertain about the correct behavior. 
Unlike CBA, this active sampling approach is aggres- 
sive and uses membership query synthesis. Aggressive, 
since it actively selects highly informative samples. 
And, unlike CBA, it can select ("synthesize") queries 
from the whole input space. |Judah et al| poll] ) pro- 
pose a very similar approach, the imitation query-by- 
committee (IQBC) algorithm, which differs only from 
the previous active sampling approach in the fact that 
the learner is able to accommodate the notion of "bad 
states", i.e., states to be avoided during task execution. 



Cohn et al (2011 ) propose another closely related ap- 



proach that, however, uses a different criterion to se- 
lect which situations to query. EMG-AQS (Expected 
Myopic Gain Action Querying Strategy) queries the ex- 
pert for the correct action in those states where the ex- 
pected gain of information is potentially larger. Unfor- 



tunately, as discussed by Cohn et al (2011), the deter- 



mination of the expected gain of information requires 
extensive computation, rendering EMG-AQS compu- 
tationally costly. On a different line of work, |Ross et al 
(2011); Judah et al (2012) address imitation learning 
using a no-regret framework, and propose algorithms 
for direct imitation learning with provable bounds on 



the regret. Finally, Melo and Lopes (2010) use active 



learning in a metric approach to learning from demon- 
stration. 

Our approach in this paper is a modified version of 



our original active sampling algorithm (Lopes et al 



2009b). We depart from the generalized binary search 



(GBS) algorithm of Nowak (2011) and adapt it to the 
IRL setting. To this purpose, we cast IRL as a (multi- 
class) classification problem and extend the GBS al- 



1 Related ideas are further explored in the dogged 



architecture of Grollman and Jenkins (2007) 
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gorithm of Nowak (20111 to this multi-class setting. 



We analyze the sample complexity of our GBS-IRL ap- 
proach, thus providing the first active IRL algorithm 
with provable bounds on sample complexity. Also, to 
the extent of our knowledge, GBS-IRL is the first ag- 
gressive active learning algorithm for non-separable, 



multi-class data (Dasgupta 2011). 



We conclude this discussion of related work by point- 
ing out that all above works describe systems that learn 
from human feedback. However, other forms of expert 
advice have also been explored in the agent learning lit- 
erature. Price and Boutilier (1999, 2003) have explored 



how a learning agent can improve its performance by 
observing other similar agents, in what could be seen 
as "implicit" imitation learning. In these works, the 
demonstrator is, for all purposes, oblivious to the fact 
that its actions are being observed and learned from. 
Instead, the learned observes the behavior of the other 
agents and extracts information that may be useful for 
its own learning (for example, it may extract useful 
information about the world dynamics). 

In a more general setting, |Barto and Rosenstein 
(2004) discuss how different forms of supervisory in- 



formation can be integrated in a reinforcement learn- 
ing architecture to improve learning. Finally, Knox 



and Stone ( 2009 2010 ) introduce the TAMER paradigm, 



that enables a reinforcement learning agent to use hu- 
man feedback (in addition to its reinforcement signal) 
to guide its learning process. 

1.2 Contributions 

Our contributions can be summarized as follows: 

• A novel active IRL algorithm, GBS-IRL, that ex- 
tends generalized binary search to a multi-class 
setting in the context of IRL. 

• The sample- complexity analysis of GBS-IRL. We 
establish, under suitable conditions, the exponen- 
tial convergence of our active learning method, as 
a function of the number of samples. As pointed 
out earlier, to our knowledge ours is the first 
work providing sample complexity bounds on ac- 
tive IRL. Several experimental results confirm the 
good sample performance of our approach. 

• A general discussion on how different forms of ex- 
pert information (namely action and reward in- 
formation) can be integrated in our IRL setting. 
We illustrate the applicability of our ideas in sev- 
eral simple scenarios and discuss the applicability 



of these different sources of information in face of 
our empirical results. 

From a broader perspective, our analysis is a non- 



trivial extension of the results of Nowak (2011) to a 



multiclass setting, having applications not only on IRL 
but on any multiclass classification problem. 

2 Background and Notation 

This section introduces background material on 
Markov decision processes and the Bayesian inverse re- 
inforcement learning formalism, upon which our con- 
tributions are developed. 

2.1 Markov Decision Processes 

A Markov decision problem (MDP) describes a sequen- 
tial decision problem in which an agent must choose 
the sequence of actions that maximizes some reward- 
based optimization criterion. Formally, an MDP M is 
a tuple Ai = (X, A, P, r, 7), where X represents the 
state-space, A the finite action space, P represents the 
transition probabilities, r is the reward function and 
7 is a positive discount factor. P(y \ x,a) denotes the 
probability of transitioning from state x to state y when 
action a is taken, i.e., 

P{y \x,a)=F [X t+1 = y | X t = x, A t = a] , 

where each Xt,t = 1,..., is a random variable (r.v.) 
demoting the state of the process at time-step t and At 
is a r.v. denoting the action of the agent at time-step 
t. 

A policy is a mapping tt : X X A — > [0, 1], where 
7r(x, a) is the probability of choosing action a £ A in 
state x £ X. Formally, 

vr(x, a) = P [A t = a | X t = x) . 

It is possible to associate with any such policy tt a 
value-function, 



V w (x) = E w 



Y J l t r(X t ,A t )\Xv = x 
Lt=o 



where the expectation is now taken over possible tra- 
jectories of {Xt} induced by policy it. The purpose of 
the agent is then to select a policy tt* such that 

for all x E X. Any such policy is an optimal policy 
for that MDP and the corresponding value function is 
denoted by V*. 
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Given any policy tt, the following recursion holds 

v*{x) = r v (x) + 7 p -( x ' y) v *(y) 

ydX 

where P n (x,y) = J2aeA 7r ( x ' a ) P a(x, y) and r n (x) = 
7r ( x ' a ) r ( x > a )- For the particular case of the op- 
timal policy tt* , the above recursion becomes 



V*(x) = max 



r(x,a)+j^2P a (x,y)V*(y) 
y ex 



We also define the Q-function associated with a pol- 
icy 7r as 

Q*(x, a) = r(x, a) + 7 ]T P a(^ 

which, in the case of the optimal policy, becomes 
Q*(x, a) = r(x, a) + 7 ^ P a {x, y)V*{y) 



*(x,a) + 7^Z p a(x,y)m&xQ*(y,b). 

ye* 



2.2 Bayesian Inverse Reinforcement Learn- 
ing 

As seen above, an MDP describes a sequential decision 
making problem in which an agent must choose its ac- 
tions so as to maximize the total discounted reward. 
In this sense, the reward function in an MDP encodes 
the task of the agent. 

Inverse reinforcement learning (IRL) deals with the 
problem of recovering the task representation (i.e., the 
reward function) given a demonstration of the behavior 
to be learned (i.e., the desired policy). In this paper 



we adopt the formulation in Ramachandran and Amir 



(2007), where IRL is cast as a Bayesian inference prob- 
lem, in which the agent is provided with samples of the 
desired policy, tt*, and it must identify the target re- 
ward function, r* , from a general set of possible func- 
tions 1Z. Prior to the observation of any policy sample 
and given any measurable set R C 7Z, the initial belief 
that r* £ R is encoded in the form of a probability 
density function p defined on 1Z, i.e., 



[r* e R] 



p(r)dr. 



R 



As discussed by 


Ramachandran and Amir 


( 


2007); 


Lopes et al 


(2009b 


), it is generally impractical to ex- 



plicitly maintain and update p. Instead, as in the afore- 
mentioned works, we work with a finite (but potentially 



very large) sample of 1Z obtained according to p. We 
denote this sample by 1Z P , and associate with each el- 
ement r k £ IZp a prior probability po(r k ) given by 

/ \ P( r k) 
Po{rk) = ^ — i—:- 

Associated with each reward r k £ 1Z P and each x € X, 
we define the set of greedy actions at x with respect to 
r k as 

Ak(x) = {a £ A I a £ argmaxQfc(x, a)} 

where Q k is the Q-function associated with the opti- 
mal policy for r k , as defined in ([I]). From the sets 
Ak(x), x £ X, we define the greedy policy with respect 
to r k as the mapping n k : X x A — > [0, 1] given by 



TT k (x,a) 



\A k {x)\ ' 



where we write Ijj to denote the indicator function for 
a set U. In other words, for each x £ X, the greedy 
policy with respect to r k is defined as a probability 
distribution that is uniform in A k (x) and zero in its 
complement. We assume, without loss of generality, 
that for any rj, rj £ 1Z P , Ai(x) 7^ Aj(x) for at least one 
.r C XE 

For any r k £ 1Z P , consider a perturbed version of TT k 
where, for each x £ X , action a £ A is selected with a 
probability 



TT k (x,a) 



P k (x) if a ^ A k {x) 
7 fc (x) if a € A k (x), 



(2) 



where, typically, f3 k {x) < 7j-(x) ¥\ We note that both 
Tr k and the uniform policy can be obtained as limits of 
7r fc , by setting (3 k (x) = or f} k {x) = Jk(x), respectively. 
Following the Bayesian IRL paradigm, the likelihood of 
observing an action a by the demonstrator at state x, 
given that the target task is r k , is now given by 



4(x, a) 



[At 



X t = x ,r* = r k ] = fc k (x,a). 



(3) 



2 This assumption merely ensures that there are no redundant 
rewards on 1Z P . If two such rewards Ti,rj existed in 1Z P , we 
could safely discard one of the two, say rj, setting po(ri) <— 
Po(n) +p (rj). 

3 Policy 7Tfc assigns the same probability, fk{x) to all actions 
that, for the particular reward rk, are optimal in state x. Simi- 
larly, it assigns the same probability, /3fc(x), to all corresponding 
sub-optimal actions. This perturbed version of -n^ is convenient 
both for its simplicity and because it facilitates our analysis. 
However, other versions of perturbed policies have been consid- 
ered in the IRL literature — see, for example, the works of Ra- 



et al (2009b) 



machandran and Amir (20071; Neu and Szepesvari (20071; Lopes 
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Given a history of t (independent) observations, 
Tt = {(x T , a T ),r = 0, . . . , t}, the likelihood in <^ can 
now be used in a standard Bayesian update to com- 
pute, for every r k G 1Z P , the posterior probability 



Pt{r k ) 



F [r* = r k ] F [Tt \ r* = r k ] 
Z 

Po(n01lt=o 4(av,a-r) 



where Z is a normalization constant. 

For the particular case of r* we write the correspond- 
ing perturbed policy as 



tc*(x, a) 



7*(x) i£aeA*(x), 



and denote the maximum noise level as the positive 
constant a defined as 

a = sup /3*(x). 

3 Multiclass Active Learning for In- 
verse Reinforcement Learning 

In this section we introduce our active learning ap- 
proach to IRL. 

3.1 Preliminaries 

To develop an active learning algorithm for this setting, 
we convert the problem of determining r* into an equiv- 
alent classification problem. This mostly amounts to 
rewriting of the Bayesian IRL problem from Section [2] 
using a different notation. 

We define the hypothesis space T~L as follows. For 
every r k G TZ P , the kth hypothesis : X — > {—1, l}'" 4 ' 
is defined as the function 



h k (x,a) 



21 



A k (x 



)(<0 



1, 



where we write h k (x,a) to denote the ath component 
of h k (x). Intuitively, h k (x) identifies (with a value of 
1) the greedy actions in x with respect to r k , assigning 
a value of —1 to all other actions. We take % as the set 
of all such functions h^. Note that, since every reward 
prescribes at least one optimal action per state, it holds 
that for every h G H and every x G X there is at least 
one a £ A such that h(x, a) = 1. We write h* to denote 
the target hypothesis, corresponding to r* . 

As before, given a hypothesis h G T~L, we define the 
set of greedy actions at x according to h as 

Ah(x) = {a G A | h(x, a) = 1} . 



For an indexed set of samples, {(x\,a\), A G A}, we 
write h\ to denote h(x\, a\), when the index set is 
clear from the context. 

The prior distribution po over 1Z P induces an equiv- 
alent distribution over T~L, which we abusively also de- 
note as pq, and is such that po(h. k ) = po(r k ). We let 
the history of observations up to time-step t be 

Tt = {(x T ,a T ),T = 0, ... ,t} , 

and /3h and 7h be the estimates of /3* and 7* associated 
with the hypothesis h. Then, the distribution over H 
after observing Tt can be updated using Bayes rule as 

p t (h) 4 F [h* = h I T t ] 

cx P [at I x u h* = h, T t -i] F [h* = h I Tt-i] 

= F[a t \ x t , h* = h] F [h = h* I T t -i] 

« 7h(xt) (1+ht)/2 /3h(^) (1 - /lt)/ V-i(h), (4) 



where we assume, for all x G X, 

\A h (x)\Mx)<\A*(x)\ 7 *(x), 



(5) 



and pt(h) is normalized so that YlheH PtO 3 ) = 1- Note 
that, in Q, we accommodate for the possibility of hav- 
ing access (for each hypothesis) to inaccurate estimates 
Ph and 7h of (3* and 7*, respectively. 

We consider a partition of the state-space X into 
a disjoint family of N sets, H = {X\, . . . , Xjy} such 
that all hypotheses h G H are constant in each set 
Xi,i = l...,N. In other words, any two states 
x, y lying in the same X{ are indistinguishable, since 
h(x,a) = h(y,a) for all a G A and all h G %. This 
means that our hypothesis space % induces an equiva- 
lence relation in X in which two elements x, y G X are 
equivalent if {x,y} C X{. We write [x\ to denote the 
(any) representative of the set Xi^ 



The following definitions extend those of Nowak 



(2011). 



Definition 1 (^-neighborhood). Two sets Xi,Xj G H 
are said to be /c-neighbors if the set 

{h£U\ A h ([x]i) + A h {[x]j)} 

has, at most, k elements, i.e., if there are k or fewer 
hypotheses in % that output different optimal actions 
in Xi and Xj. 



4 While this partition is, perhaps, of little relevance in prob- 
lems with a small state-space X, it is central in problems with 
large (or infinite) state-space, since the state to be queried has 
to be selected from a set of N alternatives, instead of the (much 
larger) set of \X\ alternatives. 
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Definition 2. The pair (X,7i) is /c-neighborly if, 
for any two sets Xi,Xj G H, there is a sequence 
{Xi , . . . , Xe n } C H such that 

• %£ = Xi and Xi n = Xj; 

• For any m, Xi m and Xg m+1 are k-neighbors. 

The notion of /c-neighborhood structures the state- 
space X in terms of the hypotheses space T~L, and this 
structure can be exploited for active learning purposes. 

3.2 Active IRL using GBS 

In defining our active IRL algorithm, we first consider 
a simplified setting in which the following assumption 



holds. We postpone to Section 3.3 the discussion of the 
more general case. 

Assumption 1. For every h £ T~L and every x G X, 

\M?)\ = i. 

In other words, we focus on the case where all hy- 
pothesis considered prescribe a unique optimal action 
per state. A single optimal action per state implies 
that the noise model can be simplified. In particular, 
the noise model can now be constant across hypothesis, 
since all h G % prescribes the same number of optimal 
actions in each state (namely, one). We denote by j(x) 
and $(x) the estimates of 7* and f3* , respectively, and 
consider a Bayesian update of the form: 



pt(h) oc l7(x t ) {1+ M /2 /3( 



(6) 



with 1 — j(x) = (\A\ — l)/3(x) and Z an adequate nor- 
malization constant. For this simpler case, ^ becomes 



/3(x) >/?*(*) 



and 



7 (x)< 7*0*0, (7) 



where, as before, we overestimate the noise rate /3*(x). 
For a given probability distribution p, define the 
weighted prediction in x as 

W(p,x) = max) p(h)h(x,a), 
aeA ' 



heH 



and the predicted action at x as 

A* (p, x) = argmax N p(h)h(x, a) 



heH 



We are now in position to introduce a first version 
of our active learning algorithm for inverse reinforce- 
ment learning, that we dub Generalized Binary Search 
for IRL (GBS-IRL). GBS-IRL is summarized in Al- 
gorithm [T] This first version of the algorithm relies 



Algorithm 1 GBS-IRL (version 1) 



Require: MDP parameters A4\r 
Require: Reward space 7Z P 
Require: Prior distribution po over 1Z 



Compute H from TZ P 

Determine partition S = X\ , . . . Xn of X 

Set To = 

for all t = 0, ... do 

Set c t = min^i,...,^ W(p t , [ac]*) 

if there are 1-neighbor sets Xi , Xj such that 



W(pt,[x]i) > c t , 

A*(pt, Mi) A*(pt, [x]j), 



W(pt,[x]j) > c t 



7 
8 
9 
10 
11 
12 
13 
14 
15 



then 

Sample xt+i from Xi or Xj with probability 1/2 
else 

Sample xt+i from the set Xi that minimizes W(pt, [x]i). 
end if 

Obtain noisy response at+i 
Set Tt+i 4—TtU {(xt+i,at+i)} 
Update pt+i from p t using |6| 
end for 

return hi = argmax hgH pt(h). 



critically on Assumption [T] In Section 3.3 we discuss 
how Algorithm [T] can be modified to accommodate sit- 
uations in which Assumption [T] does not hold. 

Our analysis of GBS-IRL relies on the following fun- 



damental lemma that generalizes Lemma 3 of Nowak 



(2011) to multi-class settings. 



Lemma 1. Let H denote a hypothesis space defined 
over a set X, where (X,T-L) is assumed k-neighborly. 
Define the coherence parameter for (X,T-L) as 



A' 



c*(X,T-L) = maxminmax > h([x]i,a)fj,(Xi), 

nCZ A II. \\C^14 L- ' 



aeA fi hen z 

t=i 

where [i is a probability measure over X. Then, for 
any probability distribution p over %, one of the two 
statements below holds: 

1. There is a set Xi G S such that 

W(p, [x]i) < c*. 

2. There are two k-neighbor sets Xi and Xj such that 

W(p, [x]i) > c* W(p, [x]j) > c* 

A*(p, [x]i)?A*(p, [x] 3 ). 



Proof. See Appendix |A.1| 



□ 



This lemma states that, given any distribution over 
the set of hypothesis, either there is a state [x]i for 
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which there is great uncertainty concerning the optimal 
action or, alternatively, there are two ^-neighboring 
states [x]i and [x]j in which all except a few hypothe- 
sis predict the same action, yet the predicted optimal 
action is strikingly different in both states. In either 
case, it is possible to select a query that is highly in- 
formative. 

The coherence parameter c* is the multi-class equiv- 
alent of the coherence parameter introduced by |Nowak 
(20111, and quantifies the informativeness of queries. 



That c* always exists can be established by noting that 
the partition of X is finite (since Ti is finite) and, there- 
fore, the minimization can be conducted exactly. On 
the other hand, if T~L does not include trivial hypotheses 
that are constant all over X, it holds that c* < 1. 

We are now in position to establish the convergence 
properties of Algorithm [T] Let P [■] and E [•] denote 
the probability measure and corresponding expecta- 
tion governing the underlying probability over noise 
and possible algorithm randomizations in query selec- 
tion. 

Theorem 1 (Consistency of GBS-IRL). Let Ft = 

{(x T ,a T ),T = l,...,t} denote a possible history of ob- 
servations obtained with GBS-IRL. If, in the update 
j3{x) and j(x) verify (j7|, then 



lim Pfh t ^ h*] = 0. 



Proof. See Appendix A. 2 



□ 



Theorem [T] establishes the consistency of active 
learning for multi-class classification. The proof re- 
lies on a fundamental lemma that, roughly speaking, 
ensures that the sequence p*(h*) is increasing in ex- 
pectation. This fundamental lemma (Lem ma [2] in 



Appendix A. 2 ) generalizes a related result of Nowak 



(2011 ) that, due to the consideration of multiple classes 



in GBS-IRL, does not apply. Our generalization re- 
quires, in particular, stronger assumptions on the noise, 
$(x), and implies a different rate of convergence, as will 
soon become apparent. It is also worth mentioning 
that the statement in Theorem [T] could alternatively 
be proved using an adaptive sub-modularity argument 
(again relyin g on Lemma [2] in Ap pendix A.2), using 
the results of Golovin and Krause (2011). 



Theorem [T ensures that, as the number of sam- 
ples increases, the probability mass concentrates on 
the correct hypothesis h*. However, it does not pro- 
vide any information concerning the rate at which 
P[li£ 7^ h*] — > 0. The convergence rate for our ac- 
tive sampling approach is established in the following 
result. 



Theorem 2 (Convergence Rate of GBS-IRL). Let H 

denote our hypothesis space, defined over X, and as- 
sume that (X,T-L) is 1-neighborly. If, in the update (j6j, 
(3(x) > a for all x G X, then 

P[h t ^h*]<\H\(l-\)\ t = 0,... (8) 

where X = e ■ min { , \ } < 1 and 

e = mm7 (xj —— hp [X) - . (9) 



j(x) 



Proof. See Appendix A. 4 



□ 



Theorem [2] extends Theorem 4 of Nowak (2011) to 
the multi-class case. However, due to the existence of 
multiple actions (classes), the constants obtained in the 
above bounds differ from those obtained in the afore- 
mentioned work (Nowak, 2011). Interestingly, for c* 



close to zero, the convergence rate obtained is near- 
optimal, exhibiting a logarithmic dependence on the 
dimension of the hypothesis space. In fact, we have 
the following straightforward corollary of Theorem [2] 

Corollary 1 (Sample Complexity of GBS-IRL). Un- 
der the conditions of Theorem^ for any given 5 > 0, 
P[h t = h*] > 1 - 5 as long as 



1 , 



To conclude this section, we note that our reduction 
of IRL to a standard (multi-class) classification prob- 
lem implies that Algorithm [T] is not specialized in any 
particular way to IRL problems — in particular, it can 
be used in general classification problems. Addition- 
ally, the guarantees in Theorems [T] and [2] are also gen- 
erally applicable in any multi-class classification prob- 
lems verifying the corresponding assumptions. 

3.3 Discussion and Extensions 

We now discuss the general applicability of our re- 
sults from Section 3.2 In particular, we discuss two 



assumptions considered in Theorem [2] namely the 1- 
neighborly condition on (X, H) and Assumption [T] We 
also discuss how additional forms of expert feedback 
may be integrated in a seamless manner in our GBS- 
IRL approach, so that the learner is able to recover 
efficiently the target task. 
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1-Neighborly Assumption: 

This assumption is formulated in Theorem [2] The 
1-neighborly assumption states that {X ,%) is 1- 
neighborly, meaning that it is possible to "structure" 
the state-space X in a manner that is coherent with 
the hypothesis space %. To assess the validity of this 
assumption in general, we start by recalling that two 
sets E S are 1-neighbors if there is a single hy- 

pothesis ho E % that prescribes different optimal ac- 
tions in Xi and Xj. Then, (X,7i) is 1-neighborly if 
every two sets X{ , Xj can be "connected" by a sequence 
of 1-neighbor sets. 

In general, given a multi-class classification problem 
with hypothesis space T~L, the 1-neighborly assumption 
can be investigated by verifying the connectivity of the 
1-neighborhood graph induced by % on X. We refer to 



Algorithm 2 GBS-IRL (version 2) 



the work of Nowak (2011) for a detailed discussion of 



this case, as similar arguments carry to our multi-class 
extension. 

In the particular case of inverse reinforcement learn- 
ing, it is important to assess whether the 1-neighborly 
assumption is reasonable. Given a finite state-space, 
X, and a finite action-space, A, it is possible to build 
a total of |^4.| different hypothesisrl As shown in 
the work of Melo et al (2010), for any such hypothesis 



it is always possible to build a non-degenerate reward 
function that yields such hypothesis as the optimal pol- 
icy. Therefore, a sufficiently rich reward space ensures 

that the corresponding hypothesis space % includes all 
I X I 

|„4|' 1 possible policies already alluded to. This triv- 
ially implies that (X, H) is not 1-neighborly. 

Unfortunately, as also shown in the aforementioned 
work (Melo et al 2010), the consideration of T~L as 
the set of all possible policies also implies that all 
states must be sufficiently sampled, since no gener- 
alization across states is possible. This observation 
supports the option in most IRL research to focus on 
problems in which rewards/policies are selected from 



some restricted set (for example, Abbeel and Ng 



2007; 



Ramachandran and Amir |2007| Neu and Szepesvari 



2004 



Syed and Schapire 2008). For the particular 
case of active learning approaches, the consideration 
of a full set of rewards/policies also implies that there 
is little hope that any active sampling will provide any 
but a negligible improvement in sample complexity. A 
related observation can be found in the work of ID as- 1 



gupta (2005) in the context of active learning for binary 



classification. 



This number is even larger if multiple optimal actions are 
allowed. 



Require: MDP parameters A4\r 
Require: Reward space 7Z P 
Require: Prior distribution po over 1Z 

1: Compute H from 1Z P 

2: Determine partition H = X\, . . . Xm of X 

3: Set To = 

4: for all t = 0, ... do 

5: Sample x t +i from the set Xi that minimizes W(pt, [x]i 
6: Obtain noisy response at+i 
7: Set T t +i <- T t U {{xt+i, a t +i)} 
8: Update pt+i from p t using |6| 
9: end for 



10: return ht = argmax 



hen 



Pt(h). 



In situations where the 1-neighborly assumption 
may not be verified, Lemma [T] cannot be used to en- 
sure the selection of highly informative queries once 
W(p, [x]i) > c* for all X{. However, it should still be 
possible to use the main approach in GBS-IRL, as de- 
tailed in Algorithm [2] For this situation, we can spe- 
cialize our sample complexity results in the following 
immediate corollary. 

Corollary 2 (Convergence Rate of GBS-IRL, ver- 
sion 2). LetH denote our hypothesis space, defined over 
X, and let f3(x) > a in the update Then, for all t 
such that W(pt, [x]i) < c* for some X^, 

P[h t /h*] < \U\{\-\)\ t = 0,... 



where A = e 



l-c* 



and e is defined in ([9| . 



Multiple Optimal Actions: 

In our presentation so far, we assumed that 1Z P is such 
that, for any r E 1Z P and any x E X, |_4 r (a;)| = 1 (As- 
sumption [T]) . Informally, this corresponds to assuming 
that, for every reward function considered, there is a 
single optimal action, tt*(x), at each x E X. This 
assumption has been considered, either explicitly or 
implicitly, in several previous works on learning by 



demonstration (see, for example, the work of Chernova 



and Veloso 2009). Closer to our own work on active 



IRL, several works recast IRL as a classification prob- 
lem, focusing on deterministic policies -k^ : X — ^ A (Ng 



and Russel 2000 



Ross and Bagnell 



2010 



Ross et al 2011 ) and therefore, 



Cohn et al[ [20TT| |Judah"et~aI] [2011] 



although not explicitly, also consider a single optimal 
action in each state. 

However, MDPs with multiple optimal actions per 
state are not uncommon (the scenarios considered in 
Section [4] for example, have multiple optimal actions 
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per state). In this situation, the properties of the re- 
sulting algorithm do not follow from our previous anal- 
ysis, since the existence of multiple optimal actions 
necessarily requires a more general noise model. The 
immediate extension of our noise model to a scenario 
where multiple optimal actions are allowed poses sev- 
eral difficulties, as optimal actions across policies may 
be sampled with different probabilities. 

In order to overcome such difficulty, we consider 
a more conservative Bayesian update, that enables a 
seamless generalization of our results to scenarios that 
admit multiple optimal actions in each state. Our up- 
date now arises from considering that the likelihood of 
observing an action from a set Ah(x) at state x is given 
by Jh.(x). Equivalently, the likelihood of observing an 
action from A — Ah(x) is given by Ph(x) = 1 — 7h(^)- 
As before, 7* and j3* correspond to the values of 7h 
and /?h for the target hypothesis, and we again let 

a = sup /3*(x). 

x£X 

Such aggregated noise model again enables the consid- 
eration of an approximate noise model that is constant 
across hypothesis, and is defined in terms of estimates 
7(x) and f3{x) of 7* (2) and (3*{x). Given the noise 
model just described, we get the Bayesian update 

Pt (h) 4 P [h* = h I F t ] 

a P [at G A h I x t , F t -x] P [h* = h | F t ^] 

= P[a t £A h \ x t ]F[h = h* I F t -!] 

« 7(^) (1+?it)/2 /3(^) (1 "' lt)/ V-i(h), (10) 

with 7(2) and /3(x) verifying ([T]). This revised formula- 
tion implies that the updates to pt are more conserva- 
tive, in the sense that they are slower to "eliminate" hy- 
pothesis from H. However, all results for Algorithm [T] 
remain valid with the new values for 7 and (3. 

Unfortunately, by allowing multiple optimal actions 
per state, it is also much easier to find (non-degenerate) 
situations where c* = 1, in which case our bounds are 
void. However, if we focus on identifying, in each state, 
at least one optimal action, we are able to retrieve 
some guarantees on the sample complexity of our active 
learning approach. We thus consider yet another ver- 
sion of GBS-IRL, described in Algorithm [3j that uses 
uses a threshold c < 1 such that, if W(pt, [x]i) > c, we 
consider that (at least) one optimal action at [x]i has 
been identified. Once this is done, it outputs the most 
likely hypothesis. Once at least one optimal action has 
been identified in all states, the algorithm stops. 



Algorithm 3 GBS-IRL (version 3) 



Require: MDP parameters A4\r 
Require: Reward space 7Z P 
Require: Prior distribution po over 1Z 



Compute H from TZ P 

Determine partition S = X\, . . . Xm of X 

Set To = 

for all t = 0, ... do 

Set c t = min^i,...,^ W(p t , [at],) 

if Ct < c then 

Sample Xt+i from the set Xi that minimizes W(pt, [x]i). 

else 

Return h t = argmax h6H pt(h). 
end if 

Obtain noisy response a t +i 
Set T t +i 4—TtU {(x t +i,a t +i)} 



Update pt+i from p t using ( 10 1 
end for 



To analyze the performance of this version of GBS- 
IRL, let the set of predicted optimal actions at x be 
defined as 



Ac(p,x) = I a 



G A I y^p(h)fe(s,o) > cj . 



We have the following results. 

Theorem 3 (Consistency of GBS-IRL, ver- 
sion 3). Consider any history of observations 
F t = {(x T , Or), r = 1, . . . , t} from GBS-IRL. If 
in the update (3 and 7 verify ([7]) for all h G %, 
then for any a G Ac(p, [x]i), 

lim F[h*([x]i,a) / l] = 0. 



t— >oo 



Proof. See Appendix A. 5 



□ 



Note that the above result is no longer formulated 
in terms of the identification of the correct hypothesis, 
but in terms of the identification of the set of opti- 
mal actions. We also have the following result on the 
sample complexity of version 3 of GBS-IRL. 

Corollary 3 (Convergence Rate of GBS-IRL, ver- 
sion 3). Let % denote our hypothesis space, defined 



over X, and let (3{x) > a in the update (10). Then, 



for all t such that W(pt, [x]i) < c* for some Xi, and all 
a G Ac(p, [x]i), 

F[h*([x] t ,a)^l] < |7£| (1 - A)*, t = 0, . . . 



where A 



.l-c* 



and e is defined in Q with the 



new 



values for 7 and j3. 
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Different Query Types: 

Finally, it is worth noting that, in the presentation 
so far admits for queries such as "What is the optimal 
action in state x?" However, it is possible to devise 
different types of queries (such as "Is action a optimal 
in state x?") that enable us to recover the stronger 
results in Theorem [2] In fact, a query such as the one 
exemplified reduces the IRL problem to a binary clas- 
sification problem over X x A, for which existing active 



learning methods such as the one of Nowak (2011 1 can 
readily be applied. 

Integrating Reward Feedback: 

So far, we discussed one possible approach to IRL, 
where the agent is provided with a demonstration 
Ft = {(x T , a T ),T = 1, . . . , t} consisting of pairs (x T , a T ) 
of states and corresponding actions. From this demon- 
stration the agent must identify the underlying target 
task, represented as a reward function, r*. We now 
depart from the Bayesian formalism introduced above 
and describe how reward information can also be inte- 
grated. 

With the addition of reward information, our demon- 
strations may now include state-reward pairs (x T ,u T ), 
indicating that the reward in state x T takes the value 
u T . This can be seen as a similar approach as those of 



Thomaz and Breazeal (2008); Knox and Stone (2010) 



for reinforcement learning. The main difference is that, 
in the aforementioned works, actions are experienced 
by the learner who then receives rewards both from 
the environment and the teacher. Another related ap- 



proach is introduced by Regan and Boutilier (2011 ), in 



the context of reward design for MDPs. 

As with action information, the demonstrator would 
ideally provide exact values for r*. However, we gen- 
erally allow the demonstration to include some level of 
noise, where 

3 ('U-rtarget (^)) 2 / & 



u x-r, r oc e v 
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where a is a non-negative constant. As with policy 
information, reward information can be used to update 
Pt(rk) as 

r k ] F [r* 



p t (r k ) = P [r* 



oc P [u t 

= Pk 



= r k 

* 

x t ,r 
xt,r 



|P [r* 
r k I 



= r k 
T t -x 



'Pt-i{r k ) 

where, as before, we allow for an inaccurate estimate 
a of a such that a > a. Given the correspondence be- 
tween the rewards in 72 p and the hypothesis in T~L, the 



above Bayesian update can be used to seamlessly inte- 
grate reward information in our Bayesian IRL setting. 

To adapt our active learning approach to accommo- 
date for reward feedback, let 

x t+ i = argmin W(p t ,[x]i). 
Xi,i=X,...,N 

i.e., xt+x is the state that would be queried by Algo- 
rithm [T] at time-step t + 1. If the user instead wishes to 
provide reward information, we would like to replace 
the query Xt+x by some alternative query x' t+1 that 
disambiguates as much as possible the actions in state 
x^x — much like a direct query to Xt+x would. 

To this purpose, we partition the space of rewards, 
TZp, into \A\ or less disjoint sets 72i, . . . ,72.141, where 
each set 72 a contains precisely those rewards r £ 1Z P for 
which 7r. r (xi + i) = a. We then select the state x' t+1 G X, 
the reward at which best discriminates between the 
sets 72.1, . . . , 72|^4|. The algorithm will then query the 
demonstrator for the reward at this new state. 

In many situations, the rewards in 1Z P allow only 
poor discrimination between the sets 72i, . . . , 72|_4|. 
This is particularly evident if the reward is sparse, since 
after a couple informative reward samples, all other 
states contain similar reward information. In Section [4] 
we illustrate this inconvenience, comparing the perfor- 
mance of our active method in the presence of both 
sparse and dense reward functions. 

4 Experimental Results 

This section illustrates the application of GBS-IRL in 
several problems of different complexity. It also fea- 
tures a comparison with other existing methods from 
the active IRL literature. 

4.1 GBS-IRL 

In order to illustrate the applicability of our proposed 
approach, we conducted a series of experiments where 
GBS-IRL is used to determine the (unknown) reward 
function for some underlying MDP, given a perturbed 
demonstration of the corresponding policy. 

In each experiment, we illustrate and discuss the 
performance of GBS-IRL. The results presented corre- 
spond to averages over 200 independent Monte-Carlo 
trials, where each trial consists of a run of 100 learn- 
ing steps, in each of which the algorithm is required 
to select one state to query and is provided the cor- 
responding action. GBS-IRL is initialized with a set 
72p of 500 independently generated random rewards. 
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This set always includes the correct reward, r* and the 
remaining rewards are built to have similar range and 
sparsity as that of r* . 

The prior probabilities, po(r), are proportional to the 
level of sparsity of each reward r. This implies that 
some of the random rewards in 1Z P may have larger 
prior probability than r* . For simplicity, we considered 
an exact noise model, i.e., f3 = (3* and 7 = 7*, where 
/3* (x) = 0.1 and j*(x) = 0.9, for all x £ X. 

For comparison purposes, we also evaluated the per- 
formance of other active IRL approaches from the lit- 
erature, to know: 

• The imitation query-by- committee algorithm 



Random-IOS- 



IQBC) of Judah et al (2011), that uses an 



entropy-based criterion to select the states to 
query. 

The expected myopic gain algorithm (EMG) of 



Cohn et al (2011), that uses a criterion based 



on the expected gain of information to select the 
states to query. 

As pointed out in Section |1.1| IQBC is, in its core, 
very similar to GBS-IRL, the main differences being 
in terms of the selection criterion and of the fact that 
the IQBC is able to accommodate the notion of "bad 
states". Since this notion is not used in our examples, 
we expect the performance of both methods to be es- 
sentially similar. 

As for EMG, this algorithm queries the expert for the 
correct action in those states where the expected gain 



of information is potentially larger (Cohn et al 2011). 



This requires evaluating, for each state x E X and 
each possible outcome, the associated gain of informa- 
tion. Such method is, therefore, fundamentally differ- 
ent from GBS-IRL and we expect this method to yield 
crisper differences from our own approach. Addition- 
ally, the above estimation is computationally heavy, as 
(in the worst case) requires the evaluation of an MDP 
policy for each state-action pair. 

Small-sized random MDPs 

In the first set of experiments, we evaluate the perfor- 
mance of GBS-IRL in several small-sized MDPs with 
no particular structure (both in terms of transitions 
and in terms of rewards). Specifically, we considered 
MDPs where \X\ = 10 and either \A\ = 5 or \A\ = 10. 
For each MDP size, we consider 10 random and inde- 
pendently generated MDPs, in each of which we con- 
ducted 200 independent learning trials. This first set 
of experiments serves two purposes. On one hand, it 





(a) 



(b) 



Figure 1: Performance of all methods in random MDPs 
with \X\ = 10 and \A\ = 5. 





(a) 



(b) 



Figure 2: Performance of all methods in random MDPs 
with \X\ = 10 and \A\ = 10. 



illustrates the applicability of GBS-IRL in arbitrary 
settings, by evaluating the performance of our method 
in random MDPs with no particular structure. On 
the other hand, these initial experiments also enable a 
quick comparative analysis of GBS-IRL against other 
relevant methods from the active IRL literature. 



Figures 1(a) and |2 (a) depict the learning curve for 
all three methods in terms of policy accuracy. The 
performance of all three methods is essentially similar 
in the early stages of the learning process. However, 
GBS-IRL slightly outperforms the other two methods, 
although the differences from IQBC are, as expected, 
smaller than those from EMG. 

While policy accuracy gives a clear view of the learn- 
ing performance of the algorithms, it conveys a less 
clear idea on the ability of the learned policies to com- 
plete the task intended by the demonstrator. To eval- 
uate the performance of the three learning algorithms 
in terms of the target task, we also measured the loss 
of the learned policies with respect to the optimal pol- 
icy. Results are depicted in Figs. 1(b) and 2(b) These 



results also confirm that the performance of GBS-IRL 
is essentially similar. In particular, the differences ob- 
served in terms of policy accuracy have little impact in 
terms of the ability to perform the target task compe- 
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Comparative computational performance 



(a) Policy accuracy (50 x 5) 



(b) Policy accuracy (100 x 
5) 



Figure 3: Average (total) computational time for prob- 
lems of different dimensions. 

tently. 

To conclude this section, we also compare the com- 
putation time for all methods in these smaller prob- 
lems. The results are depicted in Fig. [3] We em- 
phasize that the results portrayed herein are only in- 
dicative, as all algorithms were implemented in a rela- 
tively straightforward manner, with no particular con- 
cerns for optimization. Still, the comparison does 
confirm that the computational complexity associated 
with EMG is many times superior to that involved in 
the remaining methods. This, discussed earlier, is due 
to the heavy computations involved in the estimation 
of the expected myopic gain, which grows directly with 
the size of \X\ x |^4|. This observation is also in line 
with the discussion already found in the original work 



of Cohn et al (2011). 



Medium-sized random MDPs 

In the second set of experiments, we investigate how 
the performance of GBS-IRL is affected by the di- 
mension of the domain considered. To this purpose, 
we evaluate the performance of GBS-IRL in arbitrary 
medium-sized MDPs with no particular structure (both 
in terms of transitions and in terms of rewards). Specif- 
ically, we now consider MDPs where either \X\ = 50 or 
\X\ = 100, and again take either \A\ = 5 or |„4| = 10. 
For each MDP size, we consider 10 random and inde- 
pendently generated MDPs, in each of which we con- 
ducted 200 independent learning trials. 

Given the results in the first set of experiments and 
the computation time already associated with EMG, 
in the remaining experiments we opted by compar- 
ing GBS-IRL with IQBC only. The learning curves 
in terms both of policy accuracy and task execution 
are depicted in Fig. [4] 

In this set of experiments we can observe that the 
performance of IQBC appears to deteriorate more 




(c) 
10) 



Policy accuracy (50 x 



(d) Value perf. (50 x 5) 



(e) Value perf. (100 x 5) 




(f) Value perf. (50 x 10) 





(g) Policy accuracy (100 x 
10) 



(h) Value perf. (100 x 10) 



Figure 4: Classification and value performance of GBS- 
IRL and IQBC in medium-sized random MDPs. Solid 
lines correspond to GBS-IRL, and dotted lines cor- 
respond to IQBC. (a) (g) Classification performance. 



(d)J|(h) Value performance. The indicated values cor- 
respond to the dimensions \X\ x |„4| of the MDPs. 



severely with the number of actions than that of GBS- 
IRL. Although not significantly, this tendency could 
already be observed in the smaller environments (see, 



for example, Fig. |2(b) |. This dependence on the num- 
ber of actions is not completely unexpected. In fact, 
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Trap rooms 



■ 




Figure 5: The puddle- 
world domain ( Boyan 



and Moore 1995). 



Figure 6: The trap- 
world domain ( Judah 



et al 2011). 



IQBC queries states x that maximize 
VE(x) =-J2 



n H (x,a) n H (x,a) 
log- 



aeA 



where n-^ix, a) is the number of hypothesis h G T~L such 
that a G *Ah(x). Since the disagreement is taken over 
the set of all possible actions, there is some dependence 
of the performance of IQBC on the number of actions. 

GBS-IRL, on the other hand, is more focused toward 
identifying one optimal action per state. This renders 
our approach less sensitive to the number of actions, as 
can be seen in Corollaries [TJ through [3] and illustrated 
in Fig. H] 

Large-sized structured domains 

So far, we have analyzed the performance of GBS-IRL 
in random MDPs with no particular structure, both in 
terms of transition probabilities and reward function. 
In the third set of experiments, we look further into 
the scalability of GBS-IRL by considering large-sized 
domains. We consider more structured problems se- 
lected from the IRL literature. In particular, we eval- 
uate the performance of GBS-IRL in the trap-world, 
puddle-world and driver domains. 

The puddle-world domain was introduced in the 
work of |Boyan and Moore (1995), and is depicted in 
Fig. [5] It consists of a 20 x 20 grid-world in which 
two "puddles" exist (corresponding to the darker cells). 
When in the puddle, the agent receives a penalty that 
is proportional to the squared distance to the nearest 
edge of the puddle, and ranges between and —1. The 
agent must reach the goal state in the top-right corner 
of the environment, upon which it receives a reward of 



+ 1. We refer to the original description of Boyan and 




Figure 7: The driver- world domain (Abbeel and Ng 



2004). 



This domain can be described by an MDP with 
| X | = 400 and |„4| =4, where the four actions corre- 
spond to motion commands in the four possible direc- 
tions. Transitions are stochastic, and can be described 
as follows. After selecting the action corresponding 
to moving in direction d, the agent will roll back one 
cell (i.e., move in the direction — d) with a probability 
0.06. With a probability 0.24 the action will fail and 
the agent will remain in the same position. The agent 
will move to the adjacent position in direction d with 
probability 0.4. With a probability 0.24 it will move 
two cells in direction d, and with probability 0.06 it 
will move three cells in direction d. We used a dis- 
count 7 = 0.95 for the MDP (not to be confused with 
the noise parameters, ~y(x)). 

The trap-world domain was introduced in the work of 



Judah et al (2011 ), and is depicted in Fig.pl It consists 



of a 30 x 30 grid-world separated into 9 rooms. Darker 
rooms correspond to trap rooms, from which the agent 
can only leave by reaching the corresponding bottom- 
left cell (marked with a "x"). Dark lines correspond 
to walls that the agent cannot traverse. Dotted lines 
are used to delimit the trap-rooms from the safe rooms 
but are otherwise meaningless. The agent must reach 
the goal state in the bottom-right corner of the envi- 



ronment. We refer to the work of Judah et al (2011) 
for a more detailed description. 

This domain can be described by an MDP with 
\X\ = 900 and |„4| =4, where the four actions corre- 
spond to motion commands in the four possible direc- 
tions. Transitions are deterministic. The target reward 
function r* is everywhere except on the goal, where 
^*(a ; goal) = 1- We again used a discount 7 = 0.95 for 
the MDP. 

Finally, the driver domain was introduced in the 



work of Abbeel and Ng (2004), an instance of which 



Moore (1995) for further details. 



is depicted in Fig. [7J In this environment, the agent 
corresponds to the driver of the blue car at the bot- 
tom, moving at a speed greater than all other cars. All 
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(a) Policy accur. (puddle- 
w.). 



(b) Policy accur. (trap- 



f 



(c) Policy accur. (driver) 



(e) Value perf. (trap- 



(d) Value perf. (puddle-w. 



(f) Value perf. (driver) 



Figure 8: Classification and value performance of GBS- 
IRL and IQBC in the three large domains. Solid lines 
correspond to GBS-IRL, and dotted lines correspond 

[EE 



Classification performance, (e)tf(f) 



to IQBC. 
Value performance. 



other cars move at constant speed and are scattered 
across the three central lanes. The goal of the agent 
is to drive as safely as possible — i.e., avoid crashing 
into other cars, turning too suddenly and, if possible, 
driving in the shoulder lanes. 

For the purposes of our tests, we represented the 
driver domain as an MDP with \X\ = 16, 875 and |.4| = 
5, where the five actions correspond to driving the car 
into each of the 5 lanes. Transitions are deterministic. 
The target reward function r* penalizes the agent with 
a value of —10 for every crash, and with a value of —1 
for driving in the shoulder lanes. Additionally, each 
lane change costs the agent a penalty of —0.1. As in 
the previous scenarios, we used a discount 7 = 0.95 for 
the MDP. 

As with the previous experiments, we conducted 200 
independent learning trials for each of the three envi- 
ronments, and evaluated the performance of both GBS- 



Figure 9: The grid-world used to illustrate the com- 
bined use of action and reward feedback. 

IRL and IQBC. The results are depicted in Fig. [8] 

We can observe that, as in previous scenarios, the 
performance of both methods is very similar. All sce- 
narios feature a relatively small number of actions, 
which attenuates the negative dependence of IQBC on 
the number of actions observed in the previous exper- 
iments. 

It is also interesting to observe that the trap-world 
domain seems to be harder to learn than the other two 
domains, in spite of the differences in dimension. For 
example, while the driver domain required only around 
10 samples for GBS-IRL to single out the correct hy- 
pothesis, the trap-world required around 20 to attain a 
similar performance. This may be due to the fact that 
the trap-world domain features the sparsest reward. 
Since the other rewards in the hypothesis space were 
selected to be similarly sparse, it is possible that many 
would lead to similar policies in large parts of the state- 
space, thus hardening the identification of the correct 
hypothesis. 

To conclude, is is still interesting to observe that, 
in spite of the dimension of the problems considered, 
both methods were effectively able to single out the cor- 
rect hypothesis after only a few samples. In fact, the 
overall performance is superior to that observed in the 
medium-sized domains, which indicates that the do- 
main structure present in these scenarios greatly con- 
tributes to disambiguate between hypothesis, given the 
expert demonstration. 

4.2 Using Action and Reward Feedback 

To conclude the empirical validation of our approach, 
we conduct a final set of experiments that aims at illus- 
trating the applicability of our approach in the presence 
of both action and reward feedback. 

One first experiment illustrates the integration of 
both reward and policy information in the Bayesian 
IRL setting described in Section |3.3| We consider the 
simple 19 x 10 grid-world depicted in Fig. |9j where 
the agent must navigate to the top-right corner of the 
environment. In this first experiment, we use random 
sampling, in which, at each time step t, the expert adds 
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Performance of reward and policy feedback 



Figure 10: Bayesian IRL using reward and action feed- 
back. 




k 



(a) 



(b) 



Figure 11: Active IRL using reward feedback: sparse 
vs dense rewards. 



one (randomly selected) sample to the demonstration 
J^t, which can be of either form (xt,at) or (xt,ut). 



Figure 10 compares the performance of Bayesian IRL 
for demonstrations consisting of state-action pairs only 
state-reward pairs only and also demonstrations that 
include both state-action and state-reward pairs. 

We first observe that all demonstration types enable 
the learner to slowly improve its performance in the 
target task. This indicates that all three sources of 
information (action, reward, and action+reward) give 
useful information to accurately identify the target task 
(or, equivalently, identify the target reward function). 

Another important observation is that a direct com- 
parison between the learning performance obtained 
with the different demonstration types may be mis- 
leading, since the ability of the agent to extract useful 
information from the reward samples greatly depends 
on the sparsity of the reward function. Except in those 
situations in which the reward is extremely informa- 
tive, an action-based demonstration will generally be 
more informative. 

In a second experiment, we analyze the performance 
of our active learning method when querying only re- 
ward information in the same grid- world environment. 
In particular, we analyze the dependence of the perfor- 



mance on the sparsity of the reward function, testing 
GBS-IRL in two distinct conditions. The first condi- 
tion, depicted in Fig. ll(a)[ corresponds to a reward 
function r* that is sparse, i.e., such that r*(x) = for 
all states x except the goal states, where r*(x goa i) = 1. 

As discussed in Section |3.3[ sparsity of rewards 
greatly impacts the learning performance of our 
Bayesian IRL approach. This phenomenon, however, is 
not exclusive to the active learning approach — in fact, 
as seen from Fig. ll(a)| random sampling also exhibits 
a poor performance. It is still possible, nonetheless, 
to detect some advantage in using an active sampling 
approach. 

In contrast, it is possible to design very informative 
rewards, by resorting to a technique proposed in the 
reinforcement learning literature under the designation 
of reward shaping (Ng et al 1999). By considering 



a shaped version of that same reward, we obtain the 



learning performance depicted in Fig. 11(b) Note how, 



in the latter case, convergence is extremely fast even 
in the presence of random sampling. 

We conclude by noting that, in the case of reward 
information, our setting is essentially equivalent to a 
standard reinforcement learning setting, for which ef- 
ficient exploration techniques have been proposed and 
may provide fruitful avenues for future research. 

5 Discussion 

In this paper we introduce GBS-IRL, a novel active IRL 
algorithm that allows an agent to learn a task from a 
demonstration by an "expert". Using a generalization 
of binary search, our algorithm greedily queries the ex- 
pert for demonstrations in highly informative states. 
As seen in Section |1.1[ and following the designation 
of Dasgupta ( 2011[ ), GBS-IRL is an aggressive active 
learning algorithm. Additionally, given our considera- 
tion of noisy samples, GBS-IRL is naturally designed 



to consider non-separable data. As pointed out by Das- 



gupta (2011 ), few aggressive active learning algorithms 



exist with provable complexity bounds for the non- 
separable case. GBS-IRL comes with such guarantees, 
summarized in Corollary [T] under suitable conditions 
and for any given 5 > 0, P [ht ^ h*] > 1 — 5, as long as 



where A is a constant that does not depend on the di- 
mension of the hypothesis space but only on the sample 
noise. 
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Additionally, as briefly remarked in Section [3. 2 1 it is 
possible to use an adaptive sub- modularity argument 
to establish the near-optimality of GBS-IRL. In fact, 
given the target hypothesis, h*, consider the objective 
function 



A.l Proof of Lemma [T] 



[ht + h* | F t ] = 1 - Pt (h* 



From Theorem[T]and its proof, it can be shown that / is 
strongly adaptive monotone and adaptive sub modular 



and use results of Golovin and Krause (2011 ) to provide 



a similar bound on sample complexity of GBS-IRL. To 
our knowledge, GBS-IRL is the first active IRL algo- 
rithm with provable sample complexity bounds. Addi- 
tionally, as discussed in Section |3.2| our reduction of 
IRL to a standard (multi-class) classification problem 
implies that Algorithm [T] is not specialized in any par- 
ticular way to IRL problems. In particular, our results 
are generally applicable in any multi-class classification 
problems verifying the corresponding assumptions. 

Finally, our main contributions are focused in the 
simplest form of interaction, when the demonstration 
consist of examples of the right action to take in dif- 
ferent situations. However, we also discuss how other 
forms of expert feedback (beyond policy information) 
may be integrated in a seamless manner in our GBS- 
IRL framework. In particular, we discussed how to 
combine both policy and reward information in our 
learning algorithm. Our approach thus provides an 
interesting bridge between reinforcement learning (or 
learning by trial and error) and imitation learning (or 
learning from demonstration). In particular, it brings 
to the forefront existing results on efficient exploration 



in reinforcement learning (Jaksch et al 2010) 



Additionally, the general Bayesian IRL framework 
used in this paper is also amenable to the integration 
of additional information sources. For example, the hu- 
man agent may provide trajectory information, or in- 
dicate states that are frequently visited when following 
the optimal path. From the MDP parameters it is gen- 
erally possible to associate a likelihood with such feed- 
back, which can in turn be integrated in the Bayesian 
task estimation setting. However, extending the active 
learning approach to such sources of information is less 
straightforward and is left as an important avenue for 
future research. 



A Proofs 

In this appendix we collect the proofs of all statements 
throughout the paper. 



The method of proof is related to that of Nowak ( 2011 ) . 
We want to show that either 

• W(p, [x]i) < c* for some Xi £ H or, alternatively, 



• There are two ^-neighbor sets Xi ,Xj £ S such 
that W(p, [x]i) > c* and W(p, [x]j) > c*, while 
A*(p, [x]i) ^ A*(p, [xjj). 

We have that, for any a € A, 

N 

hen i=i hen 
The above expression can be written equivalently as 



E, 



^2p{h)h([x]i,a) 



hen 



< c* 



12) 



Suppose that there is no x £ X such that W{p, x) < 
c* . In other words, suppose that, for every x £ X, 



W(p, x) > c* . Then, for (12) to hold, there must be 



Xi, Xj £ S and a £ A such that 

E p(h)h([x]i, a) > c* 
hen 

^2p(h)h([x]j,a) < -c*. 
hen 

Since {X ,%) is /c-neighborly by assumption, there is 
a sequence {X kl , . . . , such that X^ x = Xi, X^ = 
Xj, and every two sets Xk m ,Xk m+1 are A;- neighborly. 
Additionally, at some point in this sequence, the signal 
of X^heW P0^)fo([ x hi a ) mus t change. This implies that 
there are two /c-neighboring sets X^. and X^. such that 

^2p(h)h([x] k .,a) > c* ^2p(h)h([x] kj ,a) < -c* , 
hen hen 

which implies that 

A*{p t ,[x} ki ) ^ A*{p t ,[x} kj ), 
and the proof is complete. 

A. 2 Proof of Theorem Q] 

Let Cf denote the amount of probability mass placed 
on incorrect hypothesis by pt, i.e., 



Ct 



Pt(h*) 



The proof of Theorem [T] relies on the following fun- 
damental lemma, whose proof can be found in Ap- 
pendix A. 3 
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Lemma 2. Under the conditions of Theorem [IJ the 
process {Ct,t = 1, . . .} is a non-negative supermartin- 
gale with respect to the filtration {J~t,t = 1,...}. In 
other words, 

E [Ci+i | Ft) < C t , 

for all t > 0. 

The proof now replicates the steps in the proof of 



Theorem 3 of Nowak (2011). In order to keep the pa- 



per as self-contained as possible, we repeat those steps 
here. We have that 

F[h t + h*] < P \p t (h*) < 1/2] = P [C t > 1] < E [C t ] , 

where the last inequality follows from the Markov in- 
equality. Explicit computations yield 



E 
E 
E 



C t 



c 



t-1 



t-1 



< maxE 

T t -x 



Ct 

Ct-i 

c t 

Ct-i 

c t 



Ct-i 

Finally, expanding the recursion, 

E 



Ct-i | Tt-i 
I ?t-i 
?t-i 



Ct-i 
E [C t -!] 



>[h t ?h*]<C [ max 



C T 
C 



T-l 



Since, from Lemmaj^J E \ 3~t—i 

conclusion follows. 



(13) 

< 1 for all t, the 



A. 3 Proof of Lemma [2] 

The structure of the proof is similar to that of the proof 



of Lemma 2 of Nowak (2011). We start by explicitly 



writing the expression for the Bayesian update in (|6]). 
For all a £ A, let 

6(a)±F Pt [h(x t+h a) = l]= (14) 

and we abusively write 5t+i to denote 5(aj + i). The 
quantity 5(a) corresponds to the fraction of probability 
mass concentrated on hypotheses prescribing action a 
as optimal in state Xt+x- The normalizing factor in the 



update ^ is given by 

2Pt(t)7(xt+i) (1+hfrfl)/2 4(x*+i) (1 - ftfrfl)/2 
hen 

= Y pt( h )i( x t+i) + Y pt( h )P( x t+i) 

h:/i t+ i = l h:/i i+ i=— 1 

= S t +ii(x t +i) + (1 - 5 t+ i)f3(x t+ i). 
We can now write the Bayesian update of pt(h) as 

^{xt + i)^ +ht+l)/2 Hxt + i) {1 - ht+l)/2 



Pt+i(h) =Pt(h)- 



5 t+ il{xt+i) + (1 - 5 t+ !)P{xt + x) 



(15) 



Let 
77(0 



6(a)j(x t+1 ) + (1 - 5(a))f3(x t+1 ) 



7(x 4+ i)( 1+?i *( :Et + 1 ' a ))/ 2 /3(x t+ i)( 1 - /l *( :c *+i' a ))/ 2 

(16) 

where, as with S, we abusively write rjt+! to denote 
r/(ot+i). Then, for h*, we can now write the up- 
date ([15]) simply as p t+ i(h*) = p t (h*)/r?t+i, and 

Cf+i _ {l-pt(h*)/rn+i)m+i = Vt+i -Ptjh.*) 
C t 1-Pt(h*) 1-Pt(h*) 

The conclusion of the Lemma holds as long as 
E [rjt+i I J~t] < 1. Conditioning the expectation 
E [r/t+i I J 7 *] on Xt+i, we have that 

E [rj t +! I Ji,a;t + i] 

= Y r /(a)P [at+i = a I Tt,x t +i} 
aeA 

= Y r?(a)7*(x t+1 )( 1+?t *^+ 1 ' a ))/ 2 /3*(x t+1 )( 1 - /l *^+ 1 ' a ))/ 2 . 

Let a* denote the action in A such that h*(xt+i,a*) = 
1. This leads to 

E [7/4+1 I TtjXt+i] = rj(a*)j* (x t+ i) + Y r l( a )P*( x t+i)- 

(17) 

For simplicity of notation, we temporarily drop the ex- 
plicit dependence of /3*, j3, 7* and 7 on xt+i. Explicit 
computations now yield 

r ? (a*) 7 *(x m ) + J] 77(a)^*(x m ) 

= [6(a*)nr + (l-6(a*))P]X-+ 

7 



+ ^(a) 7 + (l-<5(a))/3] 



/3* 
J 



:<5(a*) 7 * + (l-5(a*))^ + 

7 



+ 



5(a)^ + (l-«5(o))/3* 
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Since E^. *(a) = 1 - S(a*), 

r](a*h*(x t+1 ) + v(a)P*(x t+l ) 



(l-5(a*)) 



7 /3 



+ 



<5(a*) 7 * + £ (1 - *(a))/T 



(l-«(o*)) 



^7__ + i73_ 
7 /3 



+ 



+ <5(a*) 7 * + (|„4| - 1)0* - (1 - 5(a*))P* 



(l-5(o*)) 



07__ + 7^_ 

7 /3 



+ 



+ 5(a*)( 7 *+/3*) + l-7*-/r, 

where we have used the fact that — l)/3* + 7* = 1. 
Finally, we have 



V(a*h*(x t+ i) + v(a)P*(x t +i) 



l-(l-*(a*)) 
l-(l-«(a*)) 



7 /3 



7* + /3 



+ 1 

/3 7 * 7/? 



7 P 
7-/3 , */3"7 



7 



8 



Letting p = 1 — (\A\ — l)a, we have that 
E [774+1 I Tt,x t+ i] < 1 as long as 

7*(x)^Z_te + p {x) fo\-K*) > 
TO*) /3(x) 

for all x G A'. Since, for all a; £ A', (3*(x) < a and 
7*(x) > /9, we have 



= ( 7 *(x)-/3*(x)) 
> (p - a) 



/3(x) 
7*(g) 

70*0 /3(x) 
> 0. 



a 



7(^) /3(s) 



where the inequality is strict if ${x) > a for all x £ X. 



A. 4 Proof of Theorem [2] 

To prove Theorem [2] we depart from (13): 



F[h t ^ h*] < C ( _max ^ 
Letting 



A, 



max E 

T=l,...,i-1 



r-1 



the desired result can be obtained by bounding the se- 
quence {At, t = 0, . . .} by some value A < 1. To show 
that such A exists, we consider separately the two pos- 
sible queries in Algorithm [TJ 

Let then q = mmj = x,... n W(pti [x]i)i and suppose 
that there are no 1-neighbor sets Xi and Xj such that 

W( Pt , [x]i) > c t , W(p t , [x\j) > ct, (18) 

A*{pt, [x]i) / A*(p t , [x]j). (19) 

Then, from Algorithm [T] the queried state Xt+i will be 
such that 

x t +i € argminH^(pt, [x]i). 



Since, from the definition of c*, Q < c* , it follows that 
$( a ) < f° r ah where 5(a) is defined in (14). 

Then, from the proof of Lemma [2] 



E [774+1 I Tuxt+x] < 1 - e(l - <y(o*)) 

1-c* 
< l-e— — , 



where a* denotes the action in .A such that /i*(x, a*) = 
1. 

Consider now the case where there are 1-neighboring 
sets Xi and Afj such that ( JT8| ) holds. In this case, ac- 
cording to Algorithm [T] xj+i is selected randomly as 
either [x]i or [x]j with probability 1/2. Moreover, since 
Xi and Afj are 1-neighbors, there is a single hypothe- 
sis, say ho, that prescribes different optimal actions in 
Xi and Xj. Let a* denote the optimal action at [x]i, 
and a*j the optimal action at [ prescribed by h*. 

Three situations are possible: 

Situation 1. A*(p t ,[x]i) / a* and A*(p t ,[x]j) = a*, 
or A*( Pt ,[x]i) = a* and A*(p u [x\j) ^ 

Situation 2. [xjj) 7^ a* and [x]j) 7^ a*; 

Situation 3. [x]j) = a* and [x]j) = a*; 



19 



We consider Situation [T] first. From the proof of 
Lemma [2] 

E [r/t+i | Tt,xt + i G {[x]i, [x]j}] 



< 1 



hen 



A. 5 Proof of Theorem |3] 

Let e = 1 — c and 

e-p t {h*) 



p t (h*) 



where we explicitly replaced the definition of 5(a). If 
A*(pt, [x]i) = a* and A*(pt, [x]j) ^ a* (the alternative 
is treated similarly), we have that 

£ Pt(h)h([x] t , a*) < 1 and £ Pt(h)h([x]i, a*) < 0, 



heH 

yielding 



heH 



Let a denote an arbitrary action in Ac(pt, [x]i), for 
some [x]i, i = 1, . . . , N. Then 

P[h*([x]i,a) = -1] 



y~] Pt(h)h([x]i,a) > c + p t (h) 
h^h* 



< 



E [r) t+ i | T t ,x t +i G {[x]i, [x]j}\ < 1 
Considering Situation [2] we again have 
E [r] t+ i | Ft,x t+ i G [x]j}\ 



< 1 



hew 



where, now, 

J2p t (h)h([x]i,a*)<0 and £ p t (h)h([x]i, a*) < 0. 



hew 

This immediately implies 



heH 



E [rjt+i | G {[x]i, [x]j}] < 1 - -. 

Finally, concerning Situation [3j ho = h* . Since X{ 
and A'j are 1-neighbors, h([x]i,a*) = h([x]j,a*) for all 
hypothesis other than h*. Equivalently, h([x]i,a*) = 
— h([x]j,a*j) for all hypothesis other than h*. This im- 
plies that 

E [ Vt+1 | F u x t+l G {[x]i, [x]j}} < 1 - |(1 -pt(h*)). 
Putting everything together, 
E [rjt+i | 7i] < 

max{l- £,l-£(l- ft (h*)),l-|(l-c*)} 

and 



E 



Cr-l 



T-l 



< 



EJ^+i | F t ]-Pt(h*) 
1 -p t (h*) 

< 1 — min { — , — (1 — c* 
l4'2 v 



The proof is complete. 



□ 



/l Pt(h) > c + p 4 (h) 

h^h* 

= P[l-pt(h*)>C + pt(h)] 

= P [C t > 1] 
< E [C t ] , 

where, again, the last inequality follows from the 
Markov inequality. We can now replicate the steps in 
the proof of Theorem [T] in Appendix A. 2 to establish 
the desired result, for which we need only to prove that 

E [C t+ i | T t ] < C t . 
From Lemma [2] the result follows. □ 
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